Pulmonary arterial hypertension (PAH). is a disease characterized by pathologically increased pressures in the pulmonary vascular bed. Despite recent treatment advances, the disease remains associated with considerable morbidity and mortality. Currently disease diagnosis and prognosis relies simultaneously on invasive hemodynamic measurements and crude clinical associations. Accordingly, there is considerable interest in developing biomarkers for use in diagnosing, distinguishing subtype, treating and following treatment response in (PAH). Our group has previously looked at plasma protein biomarkers in a small clinical trial population of participants with pulmonary arterial hypertension. The next phase of our project expands on this prior work by examining RNA sequencing in whole blood samples collected from participants at 2 time points during the trial. I will use this project as an opportunity to work through an RNA sequencing workflow correlating RNA to basic demographic variables (age, BMI, gender) prior to work looking more specific markers of interest (6MWD, TAPSE, REVEAL score). This will allow me to work with the dataset without publishling results of interest.
The two faculty/staff I’ve met with regarding this project are Dr. Steven Kawut in Pulmonary Hypertension and Dr. Rui Feng in Biostatistics. Dr. Kawut would be considered my content expert, as he is a master clinician-researcher in PAH, as well as the lead on this clinical trial. He helps to direct appropriate clinical questions being asked of this data. Dr Rui Feng has expertise in RNA sequencing methods as well as statistical methods informing their use. She has been helpful in helping me understand how to format and use appropriate packages for this data.
Pulmonary arterial hypertension (PAH) is a disease of pathologically elevated pressures in the pulmonary arterial bed, leading to right heart failure and ~20% mortality at 3 years. Despite availability of several pharmacologic classes of medications for PAH, treatment algorithms remain largely etiology-agnostic and anchored on disease severity as opposed to disease entity. Tools to help discriminate disease etiology and predict treatment response are needed in order to better tailor treatment algorithms and to improve patient outcomes. We anticipate that the investigation of blood-based biomarkers represents an important means to achieving this goal. Our initial step toward exploring candidate biomarkers in PAH will be a retrospective examination of the baseline and change in expression of a subset of candidate protein biomarkers in an available small clinical trial of patients with PAH. Our findings will help direct future larger and more costly studies to test and if appropriate validate potential candidate protein biomarkers.
Following diagnosis of PAH by right heart catheterization (RHC), it is commonly classified by specific etiology, predominately idiopathic PAH (IPAH, ~40% of cases), followed by connective tissue disease-associated PAH (CTD-PAH, ~20% of cases) and PAH associated with several less common systemic diseases. Although all forms of PAH result from aberrant vascular remodeling and vasoconstriction in the pulmonary arteries, the underlying mechanisms are incompletely understood and likely mediated by each specific disease etiology. With a 4:1 predominance in female patients and increasing evidence for a role of adipose tissue in PAH, hormonal and metabolic pathways are also thought to play significant roles in PAH pathophysiology. This complex and heterogenous mechanistic milieu alongside significant mortality, reliance on invasive means of monitoring disease and an untailored treatment armamentarium makes PAH a disease for which investigation of non-invasive biomarkers is both justifiable and urgent.
Thus, I will assess RNA sequencing patterns by several demographic variables of interest including: RNA expression by active treatment arm (as data derived from clinical trial of medication X), sex, age and BMI, all of which could concievably alter RNA expression. Future aims will be to identify patterns of RNA expression by specific PAH etiology, by severity metrics including TAPSE and 6MWD and response to specific treatments.
3 Methods
Overall Approach: Here I will use a linear model to assess whether there is a correlation between RNA expression and treatment arm, adjusting for sex and visit number in the model. I will be adjusting for both of these in all analyses because the clinical trial used a sex stratification sampling method and because there are two different visit numbers. There are other ways having two visit dates for each subject could be evaluated, namely through linear mixed effects modeling. Alternatively, I could look at the outcome and adjust for corresponding baseline expression level via looping each gene. I ultimately did not use these methods for this project as each analysis was taking several hours to run making it computationally difficult. All models will be subjected to Benjamini-Hochberg adjusted p values to account for multiple comparisons. Lastly, I will also intentionally mask specific gene transcript labels for anonymity.
Aim 1: Differential expression analysis by treatment arm (linear model, binary outcome: active drug X vs placebo).
Aim 2: Differential expression analysis by sex (linear model, binary outcome: self-reported male vs female).
Aim 3: Differential expression analysis by age (linear model, continuous outcome: in years).
Aim 4: Differential expression analysis by BMI (linear model, continuous outcome: in kg/m2).
Aim 5: Complete PCA, UMAP, t-SNE for dimensionality reduction and visualization for whichever analysis above returns most DEGs. Color code by variables of interest. Provide Scree plot for variance and loadings for PCs where relevant.
Aim 6: Complete hierarchical clustering analysis using heatmap for visualization purposes on whichever analysis yields most DEGs.
Aim 7: Complete pathway analysis using GO and Reactome databases for whichever aim returns most DEGs. Will allow for looser p values if needed to model pathways.
Aim 8: Complete enrichment analysis using Gene Set Enrichment Analysis on whichever analysis yields the most DEGs.
4 Load Packages
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
✔ readr 2.1.5
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Attaching package: 'lmerTest'
The following object is masked from 'package:lme4':
lmer
The following object is masked from 'package:stats':
step
clusterProfiler v4.12.6 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
Please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan,
X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal
enrichment tool for interpreting omics data. The Innovation. 2021,
2(3):100141
Attaching package: 'clusterProfiler'
The following object is masked from 'package:purrr':
simplify
The following object is masked from 'package:stats':
filter
Loading required package: AnnotationDbi
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:lubridate':
intersect, setdiff, union
The following object is masked from 'package:limma':
plotMA
The following objects are masked from 'package:dplyr':
combine, intersect, setdiff, union
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
tapply, union, unique, unsplit, which.max, which.min
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: IRanges
Loading required package: S4Vectors
Attaching package: 'S4Vectors'
The following object is masked from 'package:clusterProfiler':
rename
The following objects are masked from 'package:Matrix':
expand, unname
The following objects are masked from 'package:lubridate':
second, second<-
The following object is masked from 'package:tidyr':
expand
The following objects are masked from 'package:dplyr':
first, rename
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Attaching package: 'IRanges'
The following object is masked from 'package:clusterProfiler':
slice
The following object is masked from 'package:lubridate':
%within%
The following object is masked from 'package:purrr':
reduce
The following objects are masked from 'package:dplyr':
collapse, desc, slice
Attaching package: 'AnnotationDbi'
The following object is masked from 'package:clusterProfiler':
select
The following object is masked from 'package:dplyr':
select
ReactomePA v1.48.0 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
If you use ReactomePA in published research, please cite:
Guangchuang Yu, Qing-Yu He. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Molecular BioSystems 2016, 12(2):477-479
Loading required package: viridisLite
5 Load Files
6 Transpose and clean RNA count sheet
## TPM dataframe is currently gene names on rows, kit_number on columns. ## Transpose so that kit_numbers are the rows and row names. TPMcounts_t <-as.data.frame(t(TPMcounts))## V1/V2 currently column name, gene name in column 1. Kit_number = row names. ## Replace V1/V2 column name with gene name in column 1. colnames(TPMcounts_t) <- TPMcounts_t[1,]## Remove redundant row with gene name now that its the column title. TPMcounts_t <- TPMcounts_t[-1,]
7 Add kit_numbers to demographics
## Currently "RNAdemo" has 84 observations for patients, whereas "RNAids" has 164 observations for 2 visits * 84 patients with some missingness. ## In order to add kit_number to RNAdemo without losing info, I will duplicate the patient info so as to add kit_numbers corresponding to visit_numRNAdemodup <- RNAdemo[rep(1:nrow(RNAdemo), each =2),]RNAdemodup$visit_num <-rep(c(1,2), nrow(RNAdemo))## Change visit_num = 2 to 3 to match other coding. RNAdemodup$visit_num <-ifelse(RNAdemodup$visit_num ==2, 3, RNAdemodup$visit_num)## Move visit_num up front. RNAdemodup <- RNAdemodup[, c("visit_num", setdiff(names(RNAdemodup), "visit_num"))]## Only save the variables I plan to work with (too many variables):# Subset only the desired columnsclininfo <- RNAdemodup[, c("pt", "visit_num", "age_der", "gender", "arm", "BMI")]clininfo <-merge(clininfo, RNAids[, c("pt", "visit_num", "kit_number")],by =c("pt", "visit_num"), all.x =TRUE)clininfo <- clininfo[, c("kit_number", setdiff(names(clininfo), "kit_number"))]
8 Remove kits that didn’t meet QC and NAs (n = 6+ 4= 10). Complete Case Analysis.
## QC - kit numbers that did not meet QC kit_numbers_to_remove <-c("1102", "1105", "1110", "1111", "1503", "1607")clininfo <- clininfo |>filter(!kit_number %in% kit_numbers_to_remove)## Leaves me with 162 rows (6 removed)- 4 NAs- remove since we wont have RNA data for these non-existant kit_numbers. sum(is.na(clininfo$kit_number))
## Visualize TPM data# Combine all values into a single numeric vectorall_values <-as.numeric(unlist(TPMcounts_t))
Warning: NAs introduced by coercion
ggplot(data.frame(all_values), aes(x = all_values)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Original TPM Data", x ="TPM Values", y ="Frequency") +theme_minimal()
Warning: Removed 57500 rows containing non-finite outside the scale range
(`stat_bin()`).
## Many zeros, add constant 1 to avoid log(0)## Remove gene nameTPMcounts_t <- TPMcounts_t[-nrow(TPMcounts_t),]# make numeric for log transformation TPMcounts_t[] <-lapply(TPMcounts_t, function(x) as.numeric(as.character(x)))## log transform TPM to work with limma, stabilize varianceTPMcounts_t_log2 <-log2(TPMcounts_t +1)## Visualize log data: all_values_log <-as.numeric(unlist(TPMcounts_t_log2))ggplot(data.frame(all_values_log), aes(x = all_values_log)) +geom_histogram(binwidth =1, fill ="skyblue", color ="black") +labs(title ="Log2 Transformed TPM Distribution", x ="TPM Values", y ="Frequency") +theme_minimal()
10 Combine log-TPM RNA expression with clinical info.
## Add kit_number:## Extract kit_number from row names into a column for merging purposes. ## Make kit_number the first column. TPMcounts_t_log2$kit_number <-rownames(TPMcounts_t_log2)TPMcounts_t_log2 <- TPMcounts_t_log2[, c("kit_number", setdiff(colnames(TPMcounts_t_log2), "kit_number"))]# Remove x from kit_number in TPMsheet to match clinical. TPMcounts_t_log2$kit_number <-gsub("^X", "", TPMcounts_t_log2$kit_number)## Merge log- TPM RNA expression with clinical info: merged <-merge(clininfo, TPMcounts_t_log2, by ="kit_number")# Recode arm from A/B to 1/2:merged$arm <-ifelse(merged$arm =="A", 1, ifelse(merged$arm =="B", 2, merged$arm))
11 Update Expression Matrix for Use in DEG
## Format logTPMcounts to expression matrix (gene name as row names)TPMcounts_t_log2_t <-t(TPMcounts_t_log2)logexpressionmatrix <- TPMcounts_t_log2_tlogexpressionmatrix <-as.matrix(logexpressionmatrix)
12 Results
13 Analysis 1: DEG for Treatment Arm Adjusted by Sex and Visit
There were no significantly differentially expressed RNA transcripts by treatment arm after adjusting for sex and visit.
Warning: Zero sample variances detected, have been offset away from zero
14 Analysis 2: Differential Expression by Sex Adjusted by Visit
There were several differentially expressed RNA transcripts by sex after adjusting for visit number.
## Does expression of any RNA transcripts differ by sex (binary, self-reported male vs. female)?## Male = 1/Female =2 ## Define Variables of Interest, Analysis DFtreatment <-factor(merged$arm)gender <-factor(merged$gender)visit <-factor(merged$visit_num)analysis2 <-model.matrix (~ gender + visit)colnames(analysis2) <-c("Intercept", "Gender2vs1", "Visit3vs1")## Fit Linear Modelfit_g2 <-lmFit(logexpressionmatrix, analysis2)## Apply Empirical Bayes Moderation: Stabilizes variance fit_g2 <-eBayes(fit_g2)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Resultsresults_g2<-topTable(fit_g2, coef ="Gender2vs1", adjust.method ="BH", number =Inf)## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisonsresults_g2$Significant <-ifelse(results_g2$adj.P.Val <=0.05, "Significant", "Not Significant")## Color Coding:results_g2$colorcoding <-ifelse( results_g2$Significant =="Significant"& results_g2$logFC >0, "Pink (Overrepresented in Female)",ifelse(results_g2$Significant =="Significant"& results_g2$logFC <0, "Blue (Overrepresented in Male)", "Not Significant"))## Volcano Plot for Sex: ggplot(results_g2, aes(x = logFC, y =-log10(adj.P.Val), color = colorcoding)) +geom_point(alpha =0.8, size =2) +scale_color_manual(values =c("Pink (Overrepresented in Female)"="hotpink4","Blue (Overrepresented in Male)"="cornflowerblue","Not Significant"="gray" ) ) +labs(title ="Differential Expression by Sex",x ="Log2 Fold Change", y ="-Log10 BH-Adjusted P-value" ) +theme_minimal() +theme(legend.title =element_blank(), legend.position ="right")
15 Aim 3: Differential Expression by Age Adjusted by Sex and Visit
There were 379 differentially expressed RNA transcripts by age after adjusting for sex and visit.
## make sure gene names are carried forward genenames <-colnames(merged)[8:57507]rownames(logexpressionmatrix) <- genenames## Does expression of any RNA transcripts differ by age (continuous, in years)?## Define Variables of Interest, Analysis DFage <- merged$agegender <-factor(merged$gender)visit <-factor(merged$visit_num)analysis3 <-model.matrix (~ age + gender + visit)colnames(analysis3) <-c("Intercept", "Age", "Gender2vs1", "Visit3vs1")## Fit Linear Modelfit_h3 <-lmFit(logexpressionmatrix, analysis3)## Apply Empirical Bayes Moderation: Stabilizes variance fit_h3 <-eBayes(fit_h3)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Results## Dont sort so you can add gene names backresults_h3 <-topTable(fit_h3, coef ="Age", adjust.method ="BH", number =Inf, sort.by ="none")results_h3$Gene <-rownames(fit_h3)## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisonsresults_h3$Significant <-ifelse(results_h3$adj.P.Val <=0.05, "Significant", "Not Significant")## Color Coding:results_h3$colorcoding <-ifelse( results_h3$Significant =="Significant"& results_h3$logFC >0, "Pink (Overrepresented in Young)",ifelse(results_h3$Significant =="Significant"& results_h3$logFC <0, "Blue (Overrepresented in Old)", "Not Significant"))## Volcano Plot for Age: ggplot(results_h3, aes(x = logFC, y =-log10(adj.P.Val), color = colorcoding)) +geom_point(alpha =0.8, size =2) +scale_color_manual(values =c("Pink (Overrepresented in Young)"="hotpink4","Blue (Overrepresented in Old)"="cornflowerblue","Not Significant"="gray" ) ) +labs(title ="Differential Expression by Age",x ="Log2 Fold Change", y ="-Log10 BH-Adjusted P-value" ) +theme_minimal() +theme(legend.title =element_blank(), legend.position ="right")
16 Analysis 4: Differential Expression by BMI Adjusted by Sex and Visit
There were 8 differentially expressed RNA transcripts by BMI after adjusting for sex and visit.
## Does expression of any RNA transcripts differ by BMI (continuous, in kg/m2)?## Define Variables of Interest, Analysis DFbmi <- merged$BMIgender <-factor(merged$gender)visit <-factor(merged$visit_num)analysis4 <-model.matrix (~ bmi + gender + visit)colnames(analysis4) <-c("Intercept", "BMI", "Gender2vs1", "Visit3vs1")## Fit Linear Modelfit_i4 <-lmFit(logexpressionmatrix, analysis4)## Apply Empirical Bayes Moderation: Stabilizes variance fit_i4 <-eBayes(fit_i4)
Warning: Zero sample variances detected, have been offset away from zero
## Fit Resultsresults_i4<-topTable(fit_i4, coef ="BMI", adjust.method ="BH", number =Inf)## Filter Significance by Benjamini-Hochberg Correction for Multiple Comparisonsresults_i4$Significant <-ifelse(results_i4$adj.P.Val <=0.05, "Significant", "Not Significant")## Color Coding:results_i4$colorcoding <-ifelse( results_i4$Significant =="Significant"& results_i4$logFC >0, "Pink (Overrepresented in Lower BMIs)",ifelse(results_i4$Significant =="Significant"& results_i4$logFC <0, "Blue (Overrepresented in Higher BMIs)", "Not Significant"))## Volcano Plot for Age: ggplot(results_i4, aes(x = logFC, y =-log10(adj.P.Val), color = colorcoding)) +geom_point(alpha =0.8, size =2) +scale_color_manual(values =c("Pink (Overrepresented in Lower BMIs)"="hotpink4","Blue (Overrepresented in Higher BMIs)"="cornflowerblue","Not Significant"="gray" ) ) +labs(title ="Differential Expression by BMI",x ="Log2 Fold Change", y ="-Log10 BH-Adjusted P-value" ) +theme_minimal() +theme(legend.title =element_blank(), legend.position ="right")
17 Principal Components Analysis for All Datapoints
PCA, UMAP and t-SNE were all used on the significant age gene transcript sets in order to visualize trends. Color coding was used to visualize how variables of interest related to dimensionality vectors. Age demonstrated the most visually compelling trend, as would be expected on age dataset.
## PCA for Age: ## Extract significantly different age (3325 sig genes)sigagegenes <- results_h3[results_h3$Significant =="Significant", ]sigagegenenames <-rownames(sigagegenes)sigagematrix <- logexpressionmatrix[sigagegenenames, ]sigagematrix <-t(sigagematrix)## Center and scale:PCAcolumns1<- sigagematrixPCAresults1 <-prcomp(PCAcolumns1, center =TRUE, scale =TRUE)PCAscores1 <-as.data.frame(PCAresults1$x)ggplot(PCAscores1, aes(x = PC1, y = PC2)) +geom_point(color ="cornflowerblue", alpha =0.7) +labs(title ="PCA Scatterplot (PC1 vs PC2)", x ="PC1", y ="PC2") +theme_minimal()
loadings2 <-as.data.frame(PCAresults2$rotation)ggplot(loadings2, aes(x =factor(rownames(loadings2)), y = PC1)) +geom_bar(stat ="identity", fill ="darkorange") +labs(title ="Variable Loadings for PC1", x ="Variables", y ="Loading") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
ggplot(loadings2, aes(x =factor(rownames(loadings2)), y = PC2)) +geom_bar(stat ="identity", fill ="lightyellow") +labs(title ="Variable Loadings for PC2", x ="Variables", y ="Loading") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
## Color the PCA by Age for easier visualization # Extract the age variable for the same samplessampleage <-rownames(PCAcolumns2) # Samples in PCAage_sub <- merged$age[match(sampleage, merged$kit_number)] # Match age by sample namesPCAscores2 <-as.data.frame(PCAresults2$x)PCAscores2$age <- age_subggplot(PCAscores2, aes(x = PC1, y = PC2, color = age)) +geom_point(size =3, alpha =0.7) +scale_color_gradient(low ="blue", high ="red") +labs(title ="PCA Scatterplot (Colored by Age)",x ="PC1",y ="PC2",color ="Age" ) +theme_minimal()
# Create age groupsPCAscores2$age_group <-ifelse(PCAscores2$age <50, "Under 50", "Over 50")ggplot(PCAscores2, aes(x = PC1, y = PC2, color = age_group)) +geom_point(size =3, alpha =0.7) +scale_color_manual(values =c("Under 50"="blue", "Over 50"="red")) +labs(title ="PCA Scatterplot (Under 50 vs. Over 50)",x ="PC1",y ="PC2",color ="Age Group" ) +theme_minimal()
19 UMAP for Age, Followup Visit Only
umapvisit3<-umap(PCAcolumns2)umapvisit3results <- umap::umap(PCAcolumns2, n_neighbors =15, min_dist =0.1, metric ="euclidean")umapvisit3df <-as.data.frame(umapvisit3results$layout)colnames(umapvisit3df) <-c("UMAP1", "UMAP2")ggplot(umapvisit3df, aes(x = UMAP1, y = UMAP2, color = PCAscores2$age)) +geom_point(size =3, alpha =0.7) +scale_color_gradient(low ="blue", high ="red") +labs(title ="UMAP Visualization (Colored by Age)",x ="UMAP1",y ="UMAP2",color ="Age" ) +theme_minimal()
20 UMAP on Entire Dataset
## UMAP on full dataset, colored by various conditionsjoined <- merged[merged$kit_number %in%colnames(logexpressionmatrix), ]joined <- joined[match(colnames(logexpressionmatrix), joined$kit_number), ]log_t <-t(logexpressionmatrix)umapall<-umap(log_t, n_neighbors =15, min_dist =0.1, metric ="euclidean")umap_df <-as.data.frame(umapall$layout)colnames(umap_df) <-c("UMAP1", "UMAP2")umap_df$kit_number <- joined$kit_number umap_df <-cbind(umap_df, joined)umap_df <- umap_df[, !duplicated(colnames(umap_df))]## Ageggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = age_der)) +geom_point(size =3, alpha =0.7) +labs(title ="UMAP- Entire Dataset",x ="UMAP1",y ="UMAP2",color ="Age" ) +theme_minimal()
## Arm ggplot(umap_df, aes(x = UMAP1, y = UMAP2, color = arm)) +geom_point(size =3, alpha =0.7) +scale_color_manual(values =c("1"="cornflowerblue", "2"="darkorange") ) +labs(title ="UMAP- Entire Dataset",x ="UMAP1",y ="UMAP2",color ="Treatment Arm" ) +theme_minimal()
Performing PCA
Read the 158 x 50 data matrix successfully!
Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
Computing input similarities...
Building tree...
Done in 0.01 seconds (sparsity = 0.745313)!
Learning embedding...
Iteration 50: error is 56.087173 (50 iterations in 0.01 seconds)
Iteration 100: error is 60.914188 (50 iterations in 0.01 seconds)
Iteration 150: error is 63.351929 (50 iterations in 0.01 seconds)
Iteration 200: error is 61.041334 (50 iterations in 0.01 seconds)
Iteration 250: error is 61.019561 (50 iterations in 0.01 seconds)
Iteration 300: error is 1.469315 (50 iterations in 0.01 seconds)
Iteration 350: error is 0.973240 (50 iterations in 0.01 seconds)
Iteration 400: error is 0.810475 (50 iterations in 0.01 seconds)
Iteration 450: error is 0.711370 (50 iterations in 0.01 seconds)
Iteration 500: error is 0.645050 (50 iterations in 0.01 seconds)
Fitting performed in 0.09 seconds.
Demonstrates hierarchical clustering for those transcripts differentially expressed by age in participants with PAH. Number of transcripts shown both overall at 379 as well as the top 10 significant age related genes for ease of visualization. Gene labels removed for anonymity.
## Extract Genes from Prior Analysisvalid_genes <- sigagegenenames1[sigagegenenames1 %in%rownames(logexpressionmatrix)]valid_genes <- valid_genes[valid_genes %in%rownames(logexpressionmatrix)]expressheat <- logexpressionmatrix[valid_genes, ]expressheat <-as.matrix(expressheat)## All 379 Significant Age Genespheatmap( expressheat, cluster_rows =TRUE, cluster_cols =TRUE, scale ="row", show_rownames =FALSE,show_colnames =FALSE,main ="Heatmap of 379 Significant Age Related Genes")
# Top 10 Genes only valid10 <- valid_genes[1:10]expressheat10 <- logexpressionmatrix[valid10, ]expressheat10 <-as.matrix(expressheat10) pheatmap( expressheat10, cluster_rows =TRUE, cluster_cols =TRUE, scale ="row", show_rownames =FALSE,show_colnames =FALSE,main ="Heatmap of Top 10 Significant Age Related Genes")
25 Functional Gene Enrichment Analysis for Age
There is significant enrichment for the pathway “Xenobiotic Metabolism” in the significantly differentially expressed RNA transcripts by age.
## obtain list of significant age related genes from prior analysis, make sure gene label is present and sortone <- results_h5[results_h5$Significant =="Significant", ]one$Gene <-rownames(one)rankage <- one$logFCnames(rankage) <- one$Generankage <-sort(rankage, decreasing =TRUE)## update such that all symbols match ensemblconvert<-bitr(names(rankage), fromType ="ENSEMBL", toType ="SYMBOL", OrgDb = org.Hs.eg.db)
'select()' returned 1:many mapping between keys and columns
Warning in bitr(names(rankage), fromType = "ENSEMBL", toType = "SYMBOL", :
26.65% of input gene IDs are fail to map...
After log-transforming TPM RNA sequencing transcripts and adjusting for sex and visit number, age generated the highest number of differentially expressed genes within a population of clinical trial participants with pulmonary arterial hypertension (PAH). Treatment arm did not demonstrate significantly deferentially expressed genes. Sex (adjusted only for visit number) and BMI demonstrated comparatively fewer deferentially expressed RNA transcripts. Those RNA transcripts that were significantly up or downregulated in the age differential expression analysis using a Benjamini Hochberg approach to account for multiple comparisons were then analyzed and/or subjected to several commonly employed visualization methods in RNA seq projects. I first completed a principal components analysis on both the entire dataset and just the followup set. The loadings for these would allow one to identify specific transcripts of interest. I then generated PCAs, UMAPs and t-SNEs for dimensionality reduction and color coded by variables of interest to visualize how these may or may not relate to the vectors yielded by various techniques. A pathway analysis revealed that age related transcripts were most heavily related to immunologic biologic pathways. A heatmap helped to visualize hierarchical clustering within age pathways for participants with PAH. Lastly, a gene enrichment analysis showed enrichment of age-related RNA sequencing transcripts within the “Xenobiotic Metabolism” pathway.